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Abstract 

We give tight concentration bounds for mixtures of martingales that are simultaneously uniform over (a) mixture 
distributions, in a PAC-Bayes sense; and (b) all finite times. These bounds are proved in terms of the martingale 
variance, extending classical Bernstein inequalities, and sharpening and simplifying prior work. 


1 Introduction 

The concentration behavior of a martingale Mt - a discrete-time stochastic process with conditionally stationary 
increments - is well-known to have many applications in modeling sequential processes and algorithms, and so it is 
of interest to analyze for applications in machine learning and statistics. It is a long-studied phenomenon 0; despite 
their mighty generality, martingales exhibit essentially the same well-understood concentration behavior as simple 
random walks. 

Even more powerful concentration results can be obtained by considering aggregates of many martingales. Though 
these too have long been studied asymptotically (8), their non-asymptotic study was only initiated by recent paper of 
Seldin et al. 0, which proves the sharpest known results on concentration of martingale mixtures, uniformly over the 
mixing distribution in a “PAC-Bayes” sense which is essentially optimal for such bounds 0. This is motivated by and 
originally intended for applications in learning theory, as further discussed in that paper. 

In this manuscript, we simplify, strengthen, and subsume the results of 0. While that paper follows classical 
central-limit-theorem-type concentration results in focusing on an arbitrary fixed time, we instead leverage a recent 
method in Balsubramani m to achieve concentration that is uniform over finite times, extending the law of the iterated 
logarithm (LIL), with a rate at least as good as 0 and often far superior. 

In short, our bounds on mixtures of martingales are uniform both over the mixing distribution in a PAC-Bayes 
sense, and over finite times, all simultaneously (and optimally). This has no precedent in the literature. 

1.1 Preliminaries 

To formalize this, consider a a set of discrete-time stochastic processes {M t (h)}hen^ where H is possibly uncountable, 
in a filtered probability space (SI, T. {U}t>o- P). Q Define the corresponding difference sequences £ t (/i) = M t (h) — 
Mt-i(h) (which are Tt -measurable for any h, t) and conditional variance processes Vt(h) = XLi ® I -U-i] • 

The mean of the processes at time t w.r.t. any distribution p over 'H (any p £ A (7~L), as we write) is denoted by 
E p [Mt] := E^ p [M t (h)], with E p [Vt] being defined similarly. For brevity, we drop the index h when it is clear from 
context. 

Also recall the following standard definitions from martingale theory. For any h £ "H, a martingale M t (resp. 
supermartingale, submartingale) has E [£ t | Tt- i] = 0 (resp. <0, >0) for all t. A stopping time r is a function on 
SI such that {r < t} £ Tt for all f; notably, r can be infinite with positive probability (l4l). 

It is now possible to state our main result. 

1 As in (T], we just consider discrete time for convenience; the core results and proofs in this paper extend to continuous time as well, as well as 
other generalizations discussed in that paper. 
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Theorem 1 (PAC-Bayes Martingale Bernstein LIL Concentration). Let {M t {h)}hen be a set of martingales and fix a 
distribution ir £ A('H) and a 6 < 1. Suppose the difference sequence is uniformly bounded: \M t (h) — A'I t -i(h)\ < e 2 
for all t > 1 and h £ Tl w.p. 1. Then with probability >1 — 5, the following is true for all p £ A (fif). 

Suppose tq(p) = min |s : 2(e — 2)E p [14] > j? (in (|) + KL(p || 7r)) For all t > tq(p) simultaneously, 
K Wt\ | < e2 ^-^ E p [V t ] and 

|E p [M t ] | < ^ 1 6(e - 2)E p [Vt] (2 In In j + In Q j + A ^(P 11 7r )^) 

(This bound is implicit in |E p [M t }\, but notice that either |E p [M t ]\ < 1 or the iterated-logarithm term can be 
simply treated as 2 lnln (3(e — 2)E p [Vt]), making the bound explicit. ) 

As we mentioned, TheoremQ]is uniform over p and t, allowing us to track mixtures of martingales tightly as they 
evolve. The martingales indexed by 'H can be highly dependent on each other, as they share a probability space. For 
instance, M to (h 0 ) can depend in arbitrary fashion on {M t < to (/i)}^/, 0 ; it is only required to satisfy the martingale 
property for ho, as further discussed in 0. So these inequalities have found use in analyzing sequentially dependent 
processes such as those in reinforcement learning and bandit algorithms, where the choice of prior n can be tailored 
to the problem Q and the posterior can be updated based on information learned up to that time. 

The method of proof is essentially that used in JTJ. Our main observation in this manuscript is that this proof 
technique is, in a technical sense, quite complementary to the fundamental method used in all PAC-Bayes analysis j2j- 
This allows us to prove our results in a sharper and more streamlined way than previous work 0. 

1.2 Discussion 

Let us elaborate on these claims. TheoremQ]can be compared directly to the following PAC-Bayes Bernstein bound 
from 0 that holds for a fixed time: 

Theorem 2 (Restatement of Thru. 8 from 0). Fix any t. For p such that K L( p || tt) is sufficiently low compared to 

E P M, 


|Ep [Mt ]I < y (1 + e)2(e - 2)E p \V t ] (lnln + In (0 + KL(p || tt)) 

This bound is inferior to Theorem[l]in two significant ways: it holds for a fixed time rather than uniformly over 
finite times, and has an iterated-logarithm term of In In t rather than In In V t . The latter is a very significant difference 
when Vt <C t, which is precisely when Bernstein inequalities would be preferred to more basic inequalities like 
Chernoff bounds. 

Put differently, our non-asymptotic result, like those of Balsubramani m, adapts correctly to the scale of the 
problem. We say “correctly” because Theorem[I|is optimal by the (asymptotic) martingale LIL, e.g. the seminal result 
of Stout (TJ; this is true non-asymptotically too, by the main anti-concentration bound of (TJ. All these optimality 
results are for a single martingale, but suffice for the PAC-Bayes case as well; and the additive KL(p || 7r) cost of 
uniformity over p is unimprovable in general also, by standard PAC-Bayes arguments. 

1.2.1 Proof Overview 

Our method follows that of Balsubramani (Tj, departing from the more traditionally learning-theoretic techniques used 
in 0. We embark on the proof by introducing a standard exponential supermartingale construction that holds for any 
of the martingales {Mt^h^hen- 

-The actual statement of the theorem in though not significantly different, is more complicated because of a few inconvenient artifacts of 
that paper’s more complicated analysis, none of which arise in our analysis. 
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Lemma 3. Suppose |£ t | < e 2 a.s. for all t. Then for any h £ TL, the process X^{h) := exp (A M t (h) — A 2 (e — 2)Vt(h)'j 
is a supermartingale in tfor any A £ A]’ 

Proof It can be checked that e x < 1 + x + (e — 2)x 2 for x < 1. Then for any A £ [—and t > 1, 

E [exp (A&) I T t - 1 ] < 1 + AE [& \ F t - 1 ] + A 2 (e - 2)E [g \ F t -i] 

= 1 + A 2 (e - 2)E [g | F t - 1 ] < exp (A 2 (e - 2)E [g \ F t - r]) 

using the martingale property on E [£ t | Ft- i]. Therefore, E [exp (A£t — A 2 (e — 2)E [£ 2 | Ft- 1 ]) | Ft- 1 ] < 1, so 
E [X A | Ft-!] < Xl v □ 

The classical martingale Bernstein inequality for a given h and fixed time t can be proved by using Markov’s 
inequality with E [A a ], where A* oc -AA is tuned for the tightest bound, and can be thought of as setting the relative 
scale of variation being measured. 

The proof technique of this paper and its advantages over previous work are best explained by examining how to 
set the scale parameter A. 


1.2.2 Choosing the Scale Parameter 

On a high level, the main idea of Balsubramani [!l | is to average over a random choice of the scale parameter A in the 
supermartingale Xf. This allows manipulation of a stopped version of V A , i.e. V A for a particular stopping time r. 
So the averaging technique in |T] can be thought of as using “many values of A at once,” which is necessary when 
dealing with the stopped process because r is random, and so is -yA. 

All existing PAC-Bayes analyses achieve uniformity in p through the Donsker-Varadhan variational characteriza¬ 
tion of the KL divergence: 

Lemma 4 (Donsker-Varadhan Lemma (0)). Suppose p and n are probability distributions over TL, and let /(•): TL H>• 
M.be a measurable function. Then 


[f(h)]<KL(p |K) + ln( 


E, 


J(h) 


This introduces a KL(p || n) term into the bounding of A' a . However, the optimum A* is then dependent on the 
unknown p. The solution adopted by existing PAC-Bayes martingale bounds (|6l and variants) is again to use “many 
values of A at once.” In prior work, this is done explicitly by taking a union bound over a grid of carefully chosen As. 

Our main technical contribution is to recognize the similarity between these two problems, and to use the (tight) 
stochastic choice of A in 0 as a solution to both problems at once, achieving the optimal bound of TheoremQ] 


2 Proof of Theorem U 


We now give the complete proof of TheoremQ] following the presentation of JT| closely. 

For the rest of this section, define Ut := 2(e — 2)Vt, k := |, and Ao := A s in [Tj, our proof invokes 

the Optional Stopping Theorem from martingale theory, in particular a version for nonnegative supermartingales that 
neatly exploits their favorable convergence properties: 


Theorem 5 (Optional Stopping for Nonnegative Supermartingales (@, Theorem 5.7.6)). If M t is a nonnegative 
supermartingale and t a (possibly infinite) stopping time, E [M T \ < E [Mo]. 


We also use the exponential supermartingale construction of Lemma[3] which we assume to hold for M t (h ) V/i £ 
TL since they are all martingales whose concentration we require. 

Our goal is to control E p [Mt] in terms of E p [Ut], so it is tempting to try to show that e XEp Mt}~ l Ut -] is an expo¬ 
nential supermartingale. However, this is not generally true; and even if it were, would only control E \p XK pi M T- 
for a fixed p, not in a PAC-Bayes sense. 
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We instead achieve uniform control over p by using the Donsker-Varadhan lemma to mediate the Optional Stopping 
Theorem every time it is used in Balsubramani’s proof fU of the single-martingale case. This is fully captured in the 
following key result, a powerful extension of a standard moment-generating function bound that is uniform in p and 
has enough freedom (an arbitrary stopping time r) to be converted into a time-uniform bound. 

Lemma 6. Choose any probability distribution tt over TL. Then for any stopping time r and A 6 [—4y], simulta¬ 
neously for all distributions p over TL, 


E 


e AEp[M T ]-^E p [t/ T ] 


< e KL (p\\ v ) 


Proof. Using Lemma[4]with the function f(h) = \M T (h) — A- U T (h ), and exponentiating both sides, we have for all 
posterior distributions p £ A (TL) that 

e AE „[M T ]-^-E p [U T ] < e KL(p\\n) E ^ 


3 AM t -2l_[/ t 


(1) 


Therefore, E 


a AE p [M T ]-%-E p [£/ T ] 


1 ( < e^WWE* 


E 


,AM t -%-C7 t 


Theorem, and ( b ) is by Lemma[3]and Optional Stopping (Thm.©. 


C b) 


< e KL( ' p ^ 7T ' > where (a) is from© andTonelli’s 


□ 


Just as a bound on the moment-generating function of a random variable is the key to proving tight Hoeffding and 
Bernstein concentration of that variable, Lemma[6]is, exactly analogously, the key tool used to prove TheoremQ] 


2.1 A PAC-Bayes Uniform Law of Large Numbers 

First, we define the stopping time tq(p) := min : E p [Ut] > -yi (in (|) + KL(p || 7r)) | and the following “good” 


event: 


B s = £fl:Vp£ A (H), 


K [Mt] I 

Ep [U t ] 


< A 0 Vf > t 0 (p) 


( 2 ) 


Our first result introduces the reader to our main proof technique; it is a generalization of the law of large numbers 
(LLN) to our PAC-Bayes martingale setting. 

Theorem 7. Fix any S > 0. With probability >1 — 5, the following is true for all p over B: for all t > tq(p), 

K[M t ]\ 

W- 0 

To prove this, we first manipulate Lemma[6]so that it is in terms of |E p [M T ] |. 

Lemma 8. Choose any prior 7r £ A (PL). For any stopping time r and all distributions p over TL, 


E 


exp ( A 0 |E P [M t ]|-^E p [J7 t ] 


< 2e KL( ' p ^' K ' > 


Proof. Lemma|6]describes the behavior of the process Xt = e XEp ^ Mt ^. Define Y t to be the mean of \t with 
A being set stochastically: A £ {—Ao, Ao} with probability A each. After marginalizing over A, the resulting process 
is 


1 


A 2 

A 0i 


1 


\2 

A 0i 


Y t = - exp A 0 E p [M t ] - ^E p [U t \ + - exp -A 0 E p [M t ] - AlE p [U t ] > - exp A 0 |E P [M t ]\ - A»E p [U t ] 


\2 

A 0t 


(3) 


exp (a 0 E p [M r ] - $E p [[/ r ])j < e KL ^\ 


Now take r to be any stopping time as in the lemma statement. Then E 

by Lemma[6] Similarly, E [A' a=_a °] < e KL ^ n \ 

So E [Y t ] = A (E [A a=_a °] + E [A a = a °]) < e KL ( p \\ n \ Combining this with © gives the result. □ 
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A particular setting of r extracts the desired uniform LLN bound from Lemma [8] 
Proof of Theorem^ Define the stopping time 


= min < t : 3p £ A (FI) s.t. t > tq(p) and 


K [Mt] I 


> Ar 


E p [U t ] 

Then it suffices to prove that P(t < oo) < S. 

On the event {r < oo}, we have for some p that > Ao by definition of r. Therefore, using Lemma[8] we 

can conclude that for this p , 


2e KL(p\\n) > jg 


X 2 

A 0t 


exp Ao |E P [M r ] | - -AE p [U T ] | r < oo 


P(t < oo) > ^-e KL( ' p ^ s> P (t < oo) 
o 


where (b) uses E p [U T \ > E p [f7 ro ] > ^ In (| e KL ( p IW). Therefore, P(r < oo) < 5, as desired. 


□ 


2.2 Proof of Theorem Q] 

For any event E C of nonzero measure, let [•] denote the expectation restricted to E , i.e. Eg [/] = ppjy f E f(u)P(du>) 
for a measurable function / on Cl. Similarly, dub the associated measure Pe, where for any event E C {)_ we have 

-Pe( 5) = P p^f - 

Theorem|T|shows that I' (Iff) > 1 — <5. It is consequently easy to observe that the corresponding restricted outcome 
space can still be used to control expectations. 

Proposition 9. For any nonnegative r.v. X, we have E/p [X] < yA.E \X], 

Proof Using Thm. [7] E [A] = E Bs [A] P(B S ) + E B| [X] P(B C S ) > E Bg [X] (1 - 5). □ 

Just as in m. the idea of the main proof is to choose A stochastically from a probability space (Cl\, P\, P\) such 
that P\(dX) = - - on A £ [— e -2 , e -2 ] \ {0}. The parameter A is chosen independently of £i, £ 2 , • • •, so 

l A l ( 1 o sr) 

that Xf is defined on the product space. Write E A [•] to denote the expectation with respect to (Da, Pa, -Pa)- 

To be consistent with previous notation, we continue to write E [■] to denote the expectation w.r.t. the original 
probability space (D,P, P). As mentioned earlier, we use subscripts for expectations conditioned on events in this 
space, e.g. E^ [A]. (As an example, Eq [■] = E [•].) 

Before going on to prove the main theorem, we need one more result that controls the effect of averaging over A 
as above. 

Lemma 10. For all p £ A (PI) and any S, the following is true: for any stopping time r > tq(p), 


E B s 


E a 


0 \E p [M r ]-\-E p [U T ] 


> Eb 5 


2ex p(SfeS (1-fc) ) 


In 2 


E P \U T 


\-Vk)\E p [M r }\ 


Lemma[l0]is precisely analogous to Lemma 13 in |T} and proved using exactly the same calculations, so its proof 
is omitted here. Now we can prove Theorem|T| 


Proof of The orem\T\ The proof follows precisely the same method as that of Balsubramani JT), but with a more nu- 
anced setting of the stopping time r. To define it, we first for convenience define the deterministic function 


(tip) = 



^ 2 In 2 

( E p [U t ] ^ 

e iri(p||7 r )\ 

2E P [Ut] 

^ (l —vP)|E p [M t ]| j 

1 -k 

l 

6 

/ 
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Now we define the stopping time 


t = min <^t : 3p £ A(%) s.t. t > to (p) and 

[|E P [M t )\ > A 0 E P [Ut\ or (|E P [M t }\ < A 0 E P [U t ] and |E P [M t ]\ > ( t (p))} 

The rest of the proof shows that P(r = oo) >1 — 5, and involves nearly identical calculations to the main proof of 
ID (with E p [M t ] , E p \Ut] , Bg replacing what that paper writes as M t , Ut, Ag). 

It suffices to prove that P(r = oo) >1 — 5. On the event {{r < 00 } (T Bs/ 2 }, by definition there exists a p s.t. 


|E P [M t }\ > Ct(p) = 


\ 



1 2 In 2 

( E P [EL] ^ 

\ 

e KL(p\\n) 

2E p [U T \ 

^ (l —vP)|E p [M t ]| j 

1 — k 

l 

5 

/ 


I.e. a p such that 


!eX p(lEK (1 fc) ) 


In 


E P [U T 


Consider this p. Using the nonnegativity of 


2eX p( E 2EpTuI ] 


> Z e KL(p\\n) 
6 


( 4 ) 


In 2 


7 Epl^t] \ 

/(l-vT)|E p [M t ]| J 


on 


Bg /2 and letting Z A := e AE p[ MT l 2 


2e KL WW > ^ EA f > E A [E Sa/2 [Z r A ]] ^ E Bs/2 [E a [Z a ]] 


1_* - i_i 

2 2 


(d) 

> E B 5/2 


!ex p(ii 


In 2 


E P [U T ] 

[l-Vk)\M p [M r ]\ 

> ^e Ki(p||7r) P Bs/2 (r < 00 ) 


> E b , 


5/2 


!ex p(llK (1/c) ) 


In 2 


E p [t/ T ] 

(l-\/fe)|Ep[M T ]| 


T < OO 


Pb 5/2 (t < 00) 


where (a) is by Lemma[6] (b) is by Prop. 0 (c) is by Tonelli’s Theorem, (d) is by Lemma [TOl and (e) is by ©. 
After simplification, this gives 


p b 5/2 (t < 00) < 5/2 => Pb 5/2 (t = 00) > 1 


5 

2 


(5) 


and using Theorem[7]- that P(Bg/ 2 ) > 1 — | - concludes the proof. 


□ 
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